Exploring rates of speciation in the tree using BAMM

Author

Daniel Padfield

Published

February 13, 2023

Outline

This walk-through goes through the analysis of a BAMM run on our phylogenetic tree, which is short for Bayesian Analysis of Macroevolutionary Mixtures. This model identifies changes in speciation, extinction, and diversification rates on a tree without any knowledge of tip states or traits, and identifies the most likely number of transitions, where they are likely to occur, and outputs a bunch of other useful things.

From the online documentation of BAMM, we are following Section 8: Analysing BAMM Output with BAMMtools. We have previously ran bamm on our tree. Input information about that run here.

Progress and Results

  • We have plotted all the “core” shifts on the whole of the phylogenetic tree
  • Identifies lots of shift configurations, but all of them have low probability
  • Best model has at ~35 shifts
  • Have identified areas with a high density of potential shifts

Possibilities

  • Macroevolutionary cohort analysis. An introduction to this is here and explained in more detail here.
  • Do we need to re-run the model at different levels of sampled diversity to check if the results are robust?

Load in R packages

First we will load in R packages used and the metadata file used and wrangled in a previous walk-through.

Code
set.seed(42)

# load packages
library(here)
library(caper)
library(ggtree)
library(ggnewscale)
library(RColorBrewer)
library(patchwork)
library(ape)
library(phytools)
library(BAMMtools)
library(coda)
library(MetBrewer)
library(fastdivrate) # remotes::install_github("jonchang/fastdivrate")
library(nlme)
library(emmeans)
library(tidyverse)

# set where I am in the project
here::i_am('scripts/sequencing_rpoB/analyses/phylogenetics/post_bamm_analysis.qmd')

# read in habitat preference
d_habpref <- read.csv(here('data/sequencing_rpoB/phyloseq/myxococcus/habitat_preference/summary/habitat_preference_asv.csv'))

# read in phyloseq object and grab tax table
d_taxa <- readRDS(here('data/sequencing_rpoB/phyloseq/myxococcus/prevalence_filtered/ps_otu_asv_filt.rds')) %>%
  phyloseq::tax_table() %>%
  data.frame() %>%
  janitor::clean_names() %>%
  rownames_to_column('otu')

# create d_meta
d_meta <- left_join(select(d_habpref, otu, habitat_preference, num_present), select(d_taxa, otu:family))

Lineage through time plot

First we will look at a lineage through time plot of our ultrametric phylogenetic tree.

Code
# load in tree
tree <- ape::read.tree(here('data/sequencing_rpoB/raxml/trees/myxo_asv/myxo_asv_chronopl10.tre'))

# check is rooted
is.rooted(tree)
[1] TRUE
Code
# check is ultrametric
is.ultrametric(tree)
[1] TRUE
Code
# grab lineage through time data
d_ltt <-  ape::ltt.plot.coords(tree) %>%
  data.frame() %>%
  mutate(time2 = time + 1)

# create lineage through time plot
ggplot(d_ltt, aes(time2, N)) +
  geom_line() +
  theme_bw(base_size = 14) +
  labs(x = 'Relative time',
       y = 'Number of lineages')

Assess convergence of BAMM run

First we will assess convergence of our MCMC simulation.

Code
# read in mcmc output from bamm
mcmcout <- read.csv(here("data/sequencing_rpoB/bamm/bamm_asv_mcmc_out.txt"), header=TRUE)
max(mcmcout$generation)
[1] 29980000
Code
# discard some runs as burnin. We will discard the first 10% of samples
burnstart <- floor(0.3 * nrow(mcmcout))
postburn <- mcmcout[burnstart:nrow(mcmcout), ]

# calculate effective sample size
effectiveSize(postburn$N_shifts)
    var1 
433.7601 
Code
effectiveSize(postburn$logLik)
  var1 
246.27 

In general, we want the effective sample size values to be at least 200 (and 200 is on the low side, but might be reasonable for very large datasets). Both are now close to 200 with a 30% burnin.

Next we can look at the number of potential rate shifts.

Code
post_probs <- table(postburn$N_shifts) / nrow(postburn)
names(post_probs)
 [1] "8"  "9"  "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22"
[16] "23" "26"
Code
edata <- getEventData(tree, eventdata = here('data/sequencing_rpoB/bamm/bamm_asv_event_data.txt'), burnin = 0.3)
Reading event datafile:  /home/padpadpadpad/Desktop/myxo_diversification/data/sequencing_rpoB/bamm/bamm_asv_event_data.txt 
        ...........
Read a total of 1500 samples from posterior

Discarded as burnin: GENERATIONS <  8980000
Analyzing  1051  samples from posterior

Setting recursive sequence on tree...

Done with recursive sequence
Code
shift_probs <- summary(edata)

Analyzed 1051 posterior samples
Shift posterior distribution:

         8    0.00095
         9    0.03000
        10    0.09300
        11    0.16000
        12    0.19000
        13    0.18000
        14    0.13000
        15    0.08200
        16    0.06100
        17    0.04200
... omitted 6 rows


Compute credible set of shift configurations for more information:
    See ?credibleShiftSet and ?getBestShiftConfiguration
Code
# plot these
ggplot(shift_probs, aes(shifts, prob)) +
  geom_col(col = 'black', fill = 'light grey') +
  theme_bw(base_size = 14) +
  labs(x = 'Number of shifts',
       y = 'Probability')

Code
# calculate 95% CIs for the number of shifts
n_shifts_ci <- tibble(mean_shifts = mean(postburn$N_shifts),
                      lower_ci = quantile(postburn$N_shifts, 0.025),
                      upper_ci = quantile(postburn$N_shifts, 0.975))
n_shifts_ci
# A tibble: 1 × 3
  mean_shifts lower_ci upper_ci
        <dbl>    <dbl>    <dbl>
1        12.9        9       18

The model has converged, and we also now have a narrow(ish) range of potential values for the number of range shifts (between 9 and 18).

The maintainers or BAMM suggest that (usually) the best overall model from a BAMM analysis is the model with the highest Bayes factor relative to the null model, \(M_{0}\), which has zero rate shifts. However, we do not have any samples of zero shifts in our postburn in sample 8! However, we do have zero shifts in our preburn-in samples as can be seen here 0.

We can therefore calculate Bayes factors from the mcmc_out.txt file. We are not going to have a burnin because otherwise we cannot sample the example of zero shifts.

Bayes factors greater than 20 generally imply strong evidence for one model over another; values greater than 50 are very strong evidence in favour of the numerator model. There is no definitive Bayes factor criterion for “significance”, but many researchers consider values greater than 12 to be consistent with at least some effect.

Code
# list file
mcmc_file = here("data/sequencing_rpoB/bamm/bamm_asv_mcmc_out.txt")

# calculate Bayes Factors
bayes_factors <- computeBayesFactors(mcmc_file, expectedNumberOfShifts=500, burnin=0)

# grab the columns for pairwise comparisons between 0 shifts and number of shifts
d_bayes_factors <- bayes_factors[,1] %>%
  data.frame() %>%
  rownames_to_column(var = 'n_shifts') %>%
  rename(., bayes_factor = `.`)

# we can rank bayes factors and then find the the difference between these
d_bayes_factors <- arrange(d_bayes_factors, desc(bayes_factor)) %>%
  mutate(diff = c(0, abs(diff(bayes_factor))),
         cum_diff = cumsum(diff))

head(d_bayes_factors)
  n_shifts bayes_factor     diff  cum_diff
1       12     298.0613  0.00000   0.00000
2       13     263.7628 34.29856  34.29856
3       11     246.3553 17.40744  51.70600
4       14     200.5316 45.82379  97.52979
5       10     160.1684 40.36314 137.89293
6       15     113.3466 46.82181 184.71474

We can see the best model with 12 shifts is >20 away from the next best model, implying strong evidence that the best fit to the data contains 12 shifts.

BAMMtools also has a function for visualising the prior and posterior simultaneously. This is useful to see what models are not being sampled in the posterior, and also to evaluate how far from the prior the posterior has moved.

Code
# use plotPrior to visualise the prior and posterior simultaneously
d_prior <- plotPrior(mcmcout, expectedNumberOfShifts=500, burnin = 0.3) %>%
  data.frame() %>%
  janitor::clean_names() %>%
  pivot_longer(cols = contains('probs'), names_to = 'type', values_to = 'prob', names_pattern = "(.*)_probs")

Code
ggplot(d_prior, aes(n_shifts, prob, fill = type)) +
  geom_bar(col = 'black', stat = 'identity', position=position_dodge()) +
  theme_bw(base_size = 14) +
  scale_fill_manual('', values = c('grey', 'black'), labels = c('posterior', 'prior')) +
  labs(x = 'Number of shifts',
       y = 'Probability') +
  theme(legend.position = c(0.2, 0.8))

We can see that our posterior distribution has shifted from the prior which further reinforces our conclusion that the model has converged.

Plotting results from bamm model

We can now try and plot the mean, model-average diversification rates at any point along every branch of a phylogenetic tree. The standard code to do this takes so long to run that right we will not evaluate the code.

Code
plot.bammdata(edata, lwd=2)

We can calculate the credible shift set that is the set of distinct shift configurations that account for 95% of the probability of the data. Core shifts are those that contribute appreciably to your ability to model the data. Non-core shifts are simply ephemeral shifts that don’t really contribute anything: they are simply what you expect under the prior distribution for rate shifts across the tree.

Code
# calculate credible shift set
d_css <- credibleShiftSet(edata, expectedNumberOfShifts = 500, threshold = 5, set.limit = 0.95)

# number of distinct configurations in the data
d_css$number.distinct
[1] 993
Code
# view more information about the credible set
summary(d_css)

 95 % credible set of rate shift configurations sampled with BAMM

Distinct shift configurations in credible set:  993

Frequency of 9 shift configurations with highest posterior probability:


   rank     probability cumulative  Core_shifts
         1 0.0019029496 0.001902950          8
         2 0.0019029496 0.003805899          9
         3 0.0019029496 0.005708849          9
         4 0.0019029496 0.007611798         10
         5 0.0019029496 0.009514748          9
         6 0.0019029496 0.011417697          8
         7 0.0009514748 0.012369172          9
         8 0.0009514748 0.013320647          8
         9 0.0009514748 0.014272122         10

...omitted 984 additional distinct shift configurations
from the credible set. You can access the full set from your 
credibleshiftset object
Code
# can plot this credible shift set
plot.credibleshiftset(d_css)
Omitted 984 plots

However, as can be seen from the summary, even the single best shift configuration has a really low posterior probability (0.0019). Although the best model supports 12 shifts, not all of these are core shifts. There are only 8-10 core shifts identified in the best model shift configurations here.

For some datasets with large numbers of taxa and rate shifts (e.g., trees with thousands of taxa), all shift configurations may have low probability. There are simply too many parameters in the model to allow a single shift configuration to dominate the credible set. An alternative approach is to extract the shift configuration that maximises the marginal probability of rate shifts along individual branches. This is very similar to the idea of a maximum clade credibility tree in phylogenetic analysis. BAMM has a function maximumShiftCredibility for extracting this shift configuration:

Code
# calculate max shift credibility
msc_set <- maximumShiftCredibility(edata, maximize='product')

# grab the best configuration and plot it
msc_config <- subsetEventData(edata, index = msc_set$sampleindex)
plot.bammdata(msc_config, lwd=2)
addBAMMshifts(msc_config, cex = 2)

This picks a model with 9 rate shifts over the whole tree.

We can try and plot these shifts using ggtree. First we will look at the distribution of speciation rates across the tree. We need to be careful about our colour scale to prevent it being misleading.

Here we also change the tip labels of our tree so they rematch with those from our metadata.

Code
# get mean phylorates that underly the colorised plot produced by plot.bammdata
# from here: https://groups.google.com/g/bamm-project/c/W6s38xzm6OU/m/LALF47xVS54J
#mbt <- getMeanBranchLengthTree(edata, rate = "speciation")

# get the mean branch lengths from the best tree configuration as identified from maximumShiftCredibility 
mbt2 <- getMeanBranchLengthTree(msc_config, rate = "ndr")

# get shift nodes from "best model"
shiftnodes <- getShiftNodesFromIndex(edata, index = msc_set$sampleindex)

# get tree
tree_bamm <- mbt2$phy

# get the edge lengths in a dataframe
d_tree_bamm <- data.frame(tree_bamm$edge, edge_num=1:nrow(tree_bamm$edge), edge_length = tree_bamm$edge.length)
colnames(d_tree_bamm)=c("parent", "node", "edge_num", 'edge_length')

# transform these to log for the colour scale
d_tree_bamm <- mutate(d_tree_bamm, log_edge_length = log(edge_length))

# visualise edge lengths that represent speciation rates of each branch
# similar to visualising colour breaks http://bamm-project.org/colorbreaks.html#how-do-i-plot-these-histograms
p1 <- ggplot(d_tree_bamm, aes(edge_length)) +
  geom_histogram(col = 'black', fill = 'light grey') +
  theme_bw()

p2 <- ggplot(d_tree_bamm, aes(log(edge_length))) +
  geom_histogram(col = 'black', fill = 'light grey') +
  theme_bw()

p1 + p2

We will use the log edge lengths/rates of diversification because they result in a more even spread of diversification rates across a range of values.

Now we can plot the tree. We will try plot the family assignment around the edge of the tree. We also need to assign the colours the same as before.

Code
# constrained families
constrained_families <- c('Myxococcaceae', 'Vulgatibacteraceae', 'Anaeromyxobacteraceae', 'Polyangiaceae', 'Sandaracinaceae', 'Nannocystaceae', 'Haliangiaceae')

# reorder d_meta so the tip labesl link to the order of the tips in the tree
d_meta <- tibble(tip_label = tree_bamm$tip.label) %>%
  left_join(., rename(d_meta, tip_label = otu))

# find the mrca of each of the constrained families
d_meta2 <- filter(d_meta, family %in% constrained_families) %>%
  dplyr::select(family, tip_label) %>%
  group_by(family) %>%
  nest() %>%
  mutate(mrca = NA)

for(i in 1:nrow(d_meta2)){
  d_meta2$mrca[i] <- findMRCA(tree_bamm, tips = d_meta2$data[[i]]$tip_label)
}

d_meta2 <- dplyr::select(d_meta2, family2 = family, mrca) %>%
  mutate(blank_label = '')

# add colour for the different families
cols <- c(colorRampPalette(brewer.pal(11, "Spectral"))(nrow(d_meta2)))
names(cols) <- sort(d_meta2$family2)

# define colours for the different habitats
cols_hab <- met.brewer('Austria', n = 7)
names(cols_hab) <- c('marine_mud', 'freshwater', 'terrestrial', 'freshwater:terrestrial', 'generalist', 'marine_mud:terrestrial', 'freshwater:marine_mud')
hab_labels <- c('marine mud', 'freshwater', 'terrestrial', 'freshwater + terrestrial', 'generalist', 'marine mud + terrestrial', 'freshwater + marine mud')

# remove any tip labels not in our tree from d_meta
d_meta <- filter(d_meta, tip_label %in% tree_bamm$tip.label)

# make a separate column for the rare states to make their size bigger!
d_meta <- mutate(d_meta, rare = ifelse(habitat_preference %in% c('generalist', 'marine_mud:terrestrial'), 'rare', 'common'))

# plot tree using ggtree
# first colour branches and add rate shifts
p1 <- ggtree(tree_bamm, layout = 'circular', branch.length = 'none', aes(col = log_edge_length)) %<+% d_tree_bamm +
scale_color_gradientn('Net diversification (branch colours)', colors = met.brewer(name='Hiroshige', direction=-1, override.order = F), breaks=c(min(d_tree_bamm$log_edge_length, na.rm = TRUE) + abs(min(d_tree_bamm$log_edge_length, na.rm = TRUE))*0.2, max(d_tree_bamm$log_edge_length, na.rm = TRUE) * 0.95), labels=c("Slow","Fast")) +
geom_point2(aes(subset=(node %in% shiftnodes)), color="black",size=5)

# next add tip points
p2 <- p1 %<+% d_meta +
  new_scale_color() +
  geom_tippoint(aes(x=x+5, col = habitat_preference, size = rare), position = position_jitter(width = 3, height = 0)) +
  scale_color_manual('Habitat preference (tip points)', values = cols_hab, labels = hab_labels) +
  scale_size_manual(values = c(0.6, 2)) +
  guides(color = guide_legend(override.aes = list(size = 5)),
         size = 'none')

p2 +
  new_scale_colour() +
  geom_cladelab(data = d_meta2,
                mapping = aes(node = mrca,
                              color = family2,
                              label = blank_label),
                offset = 10,
                barsize = 2) +
  scale_color_manual('Family (outer bar)', values = cols) +
  guides(color = guide_legend(override.aes = list(size = 5)))

Code
ggsave(here('plots/sequencing_rpoB/analyses/bamm_tree.png'), last_plot(), height = 9, width = 12)

Rate variation through time

We can look at diversification rates change through time to see if diversification in general is faster or slower deeper in the tree, and therefore further back in evolutionary time.

We can also grab the rates through time for all the sub-trees after a rate shift to see how they compare. To do this we need to write our own function to process the data and get it ready for plotting.

Code
# write function to get rate through time into the correct format
get_rate_through_time_df <- function(ephy, ...){
  rtt <- getRateThroughTimeMatrix(ephy, ...)
  
  # get dataframe of each part
  # first speciation
  rtt_sp <- rtt$lambda %>%
  data.frame() %>%
  mutate(sample = 1:n()) %>%
  pivot_longer(starts_with('X'), names_to = 'time_point', values_to = 'speciation') %>%
  mutate(time_point = parse_number(time_point)) %>%
  group_by(sample) %>%
  mutate(time = unname(rtt$times)) %>%
  ungroup()
  
  # second extinction
  rtt_ex <- rtt$mu %>%
  data.frame() %>%
  mutate(sample = 1:n()) %>%
  pivot_longer(starts_with('X'), names_to = 'time_point', values_to = 'extinction') %>%
  mutate(time_point = parse_number(time_point)) %>%
  group_by(sample) %>%
  mutate(time = unname(rtt$times)) %>%
  ungroup()
  
  rtt_comb <- left_join(rtt_sp, rtt_ex)
  
  rtt_comb <- mutate(rtt_comb, net_div = speciation - extinction) %>%
    pivot_longer(cols = c(speciation, extinction, net_div), names_to = 'process', values_to = 'rate')
  
  return(rtt_comb)
}

# get rate through time estimates for the whole tree
rtt_all <- get_rate_through_time_df(ephy = edata)

# create means
rtt_combine_means <- group_by(rtt_all, time_point, process, time) %>%
  summarise(ave_rate = mean(rate), .groups = 'drop',
            lower_ci = quantile(rate, 0.025),
            upper_ci = quantile(rate, 0.975))

# get rate through time estimates for each shift node
rtt_shift <- tibble(shift_node = shiftnodes,
                    n = 1:length(shiftnodes)) %>%
  nest(data = shift_node) %>%
  mutate(temp = purrr::map(data, possibly(~get_rate_through_time_df(ephy = edata, node = .x$shift_node, nodetype = 'include'), otherwise = NA_real_)),
         is_tib = purrr::map_dbl(temp, is_tibble))

rtt_shift_nowork <- filter(rtt_shift, is_tib == 0) %>%
  dplyr::select(data) %>% unnest(data)
rtt_shift2 <- filter(rtt_shift, is_tib == 1) %>%
  dplyr::select(data, temp) %>%
  unnest(data) %>%
  unnest(temp)

# create means
rtt_shift_means <- group_by(rtt_shift2, time_point, process, time, shift_node) %>%
  summarise(ave_rate = mean(rate), .groups = 'drop',
            lower_ci = quantile(rate, 0.025),
            upper_ci = quantile(rate, 0.975))

# create a plot
ggplot() +
  geom_line(aes(time, ave_rate, group = shift_node), rtt_shift_means, col = 'dark grey') +
  geom_ribbon(aes(time, ymin = lower_ci, ymax = upper_ci, group = shift_node), alpha = 0.01, rtt_shift_means, fill = 'dark grey') +
  geom_line(aes(time, ave_rate), rtt_combine_means) +
  geom_ribbon(aes(time, ymin = lower_ci, ymax = upper_ci), alpha = 0.1, rtt_combine_means) +
  facet_wrap(~process, scales = 'free') +
  theme_bw(base_size = 14) +
  labs(x = 'Relative time',
       y = 'rate')

Code
# save plot out
ggsave(here('plots/sequencing_rpoB/analyses/bamm_rate_through_time.png'), last_plot(), height = 5, width = 12)

You can see that after a shift node there is generally a rapid acceleration in diversification and speciation rate, after which we see a decline, whereas extinction rate rises quickly and then plateaus through time. The uncertainty around these node specific estimates is very high, but the general pattern appears to hold.

Looking at tip-specific evolutionary rates

We can also estimate tip-specific evolutionary rates. We can plot these across our different traits to see if there is anything particularly interesting going on. This could be complemented by our models looking at state-dependent character diversification rates.

Code
# grab out tip rates 
tip_rates <- data.frame(tip_label = edata$tip.label,
                        speciation = edata$meanTipLambda,
                        extinction = edata$meanTipMu) %>%
             left_join(., dplyr::select(d_meta, tip_label, habitat_preference)) %>%
             filter(!is.na(habitat_preference)) %>%
             group_by(habitat_preference) %>%
             mutate(n = n()) %>%
             ungroup() %>%
             mutate(hab_pref_axis = gsub(':', '/ ', habitat_preference),
                    hab_pref_axis = gsub('_', ' ', hab_pref_axis),
                    net_diversification = speciation - extinction)

# check numbers are right
group_by(tip_rates, habitat_preference) %>%
  tally() %>%
  arrange(n)
# A tibble: 7 × 2
  habitat_preference                    n
  <chr>                             <int>
1 freshwater:marine_mud:terrestrial     4
2 marine_mud:terrestrial                6
3 freshwater:marine_mud               116
4 freshwater:terrestrial              482
5 marine_mud                          567
6 terrestrial                         704
7 freshwater                          742
Code
p1 <- ggplot(tip_rates, aes(forcats::fct_reorder(hab_pref_axis, n), speciation)) +
  MicrobioUoE::geom_pretty_boxplot(col='black', fill = 'black') +
  geom_point(shape = 21, fill = 'white', position = position_jitter(width = 0.2)) +
  theme_bw(base_size = 12) +
  scale_x_discrete(labels = scales::label_wrap(13)) +
  labs(x = 'Habitat preference',
       y = 'Speciation rate')

p2 <- ggplot(tip_rates, aes(forcats::fct_reorder(hab_pref_axis, n), extinction)) +
  MicrobioUoE::geom_pretty_boxplot(col='black', fill = 'black') +
  geom_point(shape = 21, fill = 'white', position = position_jitter(width = 0.2)) +
  theme_bw(base_size = 12) +
  scale_x_discrete(labels = scales::label_wrap(13)) +
  labs(x = 'Habitat preference',
       y = 'Extinction rate')

p3 <- ggplot(tip_rates, aes(forcats::fct_reorder(hab_pref_axis, n), net_diversification)) +
  MicrobioUoE::geom_pretty_boxplot(col='black', fill = 'black') +
  geom_point(shape = 21, fill = 'white', position = position_jitter(width = 0.2)) +
  theme_bw(base_size = 12) +
  scale_x_discrete(labels = scales::label_wrap(13)) +
  labs(x = 'Habitat preference',
       y = 'Net diversification rate')

p1 + p2 + p3

Code
# save plot out
ggsave(here('plots/sequencing_rpoB/analyses/bamm_tip_rates.png'), last_plot(), height = 5, width = 17)

There does not appear to be much variation between traits in their speciation and extinction rates. If we wanted to follow this further there are lots of papers we could look at. First and foremost would be to read Title & Rabosky’s paper entitled “Tip rates, phylogenies and diversification: What are we estimating, and how good are the estimates?”. They also share other papers that have used tip rate estimates to look at diversification rates across geographical and environmental gradients and across different traits.

We can calculate the DR statistic (also known as tip DR) which is an tip-level estimate of the speciation rate. This measure is non-model based, and incorporates the number of splitting events and the internode distances along the root-to-tip path of a phylogeny, while giving greater weight to branches closer to the present. This was first implemented by Jetz et al. (2012) in their Nature paper about bird diversity in space and time.

Code
# compute tip DR
d_tipdr <- DR_statistic_C(tree)

# put this into a dataframe
d_tipdr <- data.frame(tip_label = names(d_tipdr),
                      tip_dr = unname(d_tipdr)) %>%
           separate(., tip_label, c('part1', 'part2', 'part3'), sep = '_') %>%
           unite('tip_label', c(part1, part2), sep = '_') %>%
           dplyr::select(-part3) %>%
           left_join(., dplyr::select(d_meta, tip_label, habitat_preference)) %>%
           filter(!is.na(habitat_preference)) %>%
           group_by(habitat_preference) %>%
           mutate(n = n()) %>%
           ungroup() %>%
           mutate(hab_pref_axis = gsub(':', '/ ', habitat_preference),
                  hab_pref_axis = gsub('_', ' ', hab_pref_axis))

ggplot(d_tipdr, aes(forcats::fct_reorder(hab_pref_axis, n), tip_dr)) +
  MicrobioUoE::geom_pretty_boxplot(col='black', fill = 'black') +
  geom_point(shape = 21, fill = 'white', position = position_jitter(width = 0.2)) +
  theme_bw(base_size = 14) +
  scale_x_discrete(labels = scales::label_wrap(13)) +
  labs(x = 'Habitat preference',
       y = 'tip DR (speciation rate)')

This method also shows very little variation in the estimated tip-level speciation rate between traits. It does look like marine mud may have a faster speciation rate than the others. This is interesting because the results from BAMM also indicate marine mud has a higher speciation rate.

This again is something to discuss further with Rutger as I am not sure exactly where to go with the remainder of this analysis. These sorts of statistics may be very useful where fitting of SSE models with rate changes within the state are not possible. This may be something to consider with our dataset where the number of parameters in the MuHiSSE would be so big.

We can can do phylogenetic generalised least squares regressions to account for phylogentic relatedness concerning differences in diversification rate between species with different habitat preferences. This approach has been used by Jetz et al. (2012) in their Nature paper about bird diversity in space and time.

We will first do this using nlme::lme() as described in Liam Revell’s and Luke Harmon’s book “Phylogenetic Comparative Methods in R”.

Code
# remove family name from the tip label
d_labels <- data.frame(tip_label = tree$tip.label) %>%
  separate(., tip_label, c('part1', 'part2', 'part3'), sep = '_', remove = FALSE) %>%
  unite('tip_label_new', c(part1, part2), sep = '_')

tree$tip.label <- d_labels$tip_label_new

# set up correlation matrix for the tree
cor_lambda <- corPagel(value = 1, phy = tree, form = ~tip_label)

# fit phylogenetic generalised linear model
mod <- gls(tip_dr ~ habitat_preference, data = d_tipdr, correlation = cor_lambda)

# do contrasts between habitat preferences
contrasts <- emmeans(mod, pairwise ~ habitat_preference)

# save out mean branch lengths
saveRDS(mod, here('data/sequencing_rpoB/processed/tipDR_pgls.rds'))
saveRDS(contrasts, here('data/sequencing_rpoB/processed/tipDR_pgls_contrasts.rds'))
Code
# remove family name from the tip label
d_labels <- data.frame(tip_label = tree$tip.label) %>%
  separate(., tip_label, c('part1', 'part2', 'part3'), sep = '_', remove = FALSE) %>%
  unite('tip_label_new', c(part1, part2), sep = '_')

tree$tip.label <- d_labels$tip_label_new

# set up correlation matrix for the tree
cor_lambda <- corPagel(value = 1, phy = tree, form = ~tip_label)

# read in model
mod <- readRDS(here('data/sequencing_rpoB/processed/tipDR_pgls.rds'))

summary(mod)
Generalized least squares fit by REML
  Model: tip_dr ~ habitat_preference 
  Data: d_tipdr 
       AIC      BIC    logLik
  18964.28 19019.81 -9473.141

Correlation Structure: corPagel
 Formula: ~tip_label 
 Parameter estimate(s):
   lambda 
0.8277129 

Coefficients:
                                                 Value Std.Error    t-value
(Intercept)                                  2.2219425 0.8119547  2.7365349
habitat_preferencefreshwater:mud_and_shore  -0.5490924 0.3310935 -1.6584210
habitat_preferencefreshwater:terrestrial    -0.3239081 0.1722226 -1.8807528
habitat_preferencegeneralist                 0.0236776 0.7295473  0.0324552
habitat_preferencemud_and_shore              0.0886626 0.2694823  0.3290108
habitat_preferencemud_and_shore:terrestrial  0.4438919 0.7467982  0.5943933
habitat_preferenceterrestrial               -0.3474218 0.1863011 -1.8648405
                                            p-value
(Intercept)                                  0.0062
habitat_preferencefreshwater:mud_and_shore   0.0973
habitat_preferencefreshwater:terrestrial     0.0601
habitat_preferencegeneralist                 0.9741
habitat_preferencemud_and_shore              0.7422
habitat_preferencemud_and_shore:terrestrial  0.5523
habitat_preferenceterrestrial                0.0623

 Correlation: 
                                            (Intr) hb_:__ hbtt_: hbtt_prfrncg
habitat_preferencefreshwater:mud_and_shore  -0.063                           
habitat_preferencefreshwater:terrestrial    -0.123  0.285                    
habitat_preferencegeneralist                -0.027  0.058  0.140             
habitat_preferencemud_and_shore             -0.086  0.204  0.348  0.082      
habitat_preferencemud_and_shore:terrestrial -0.033  0.069  0.143  0.039      
habitat_preferenceterrestrial               -0.123  0.252  0.639  0.129      
                                            hbt___ hb___:
habitat_preferencefreshwater:mud_and_shore               
habitat_preferencefreshwater:terrestrial                 
habitat_preferencegeneralist                             
habitat_preferencemud_and_shore                          
habitat_preferencemud_and_shore:terrestrial  0.174       
habitat_preferenceterrestrial                0.341  0.141

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-0.2819430  0.1544749  0.5076096  1.1481052  7.6418458 

Residual standard error: 5.219937 
Degrees of freedom: 3540 total; 3533 residual
Code
# read in contrasts
contrasts <- readRDS(here('data/sequencing_rpoB/processed/tipDR_pgls_contrasts.rds'))

So the model fits and is ok. The estimate of Pagel’s lambda - the phylogenetic signal - is 0.83 which is quite high. There are no significant differences between any of the habitat preferences in terms of their speciation rate though.

Defining each tip as either high diversification or low diversification

One of the main questions we are interested in is whether we see a increase in the diversification rate AFTER a transition, but this is a difficult thing to test using either the BAMM model or state-dependent speciation and extinction models.

One alternative might be to create a new variable of high and low diversification rate for each tip, and then combine this with habitat preference when we explore transition rate models.

We can look at the distribution of tip rates, irregardless of habitat preference.

Code
ggplot(tip_rates, aes(net_diversification)) +
  geom_histogram(fill = 'white', col = 'black') +
  theme_bw()

There is a hint that this model is bimodal. We can run a k-means clustering on it to try and split it into two groups.

Code
# do very crude kmeans clustering
mod_clust <- kmeans(tip_rates$net_diversification, 2, algo="Lloyd")

# add this to our dataset
tip_rates <- mutate(tip_rates, cluster = as.character(mod_clust$cluster),
                    div_rate = ifelse(cluster == '1', 'low', 'high'))

ggplot(tip_rates, aes(net_diversification, fill = div_rate)) +
  geom_histogram(col = 'black') +
  theme_bw(base_size = 14) +
  theme(legend.position = c(0.8, 0.8))

We can count the number of each combination of diversification rate (high or low) and each habitat preference to see whether we have some common and rare states.

Code
# check number in each diversity rate bin: high and low
group_by(tip_rates, div_rate) %>%
  tally()
# A tibble: 2 × 2
  div_rate     n
  <chr>    <int>
1 high        93
2 low       2528
Code
# check per diversity rate bin/habitat preference combination
group_by(tip_rates, habitat_preference, div_rate) %>%
  tally() %>%
  group_by(habitat_preference) %>%
  mutate(prop = round(n/sum(n), 2)) %>%
  ungroup()
# A tibble: 11 × 4
   habitat_preference                div_rate     n  prop
   <chr>                             <chr>    <int> <dbl>
 1 freshwater                        high        17  0.02
 2 freshwater                        low        725  0.98
 3 freshwater:marine_mud             low        116  1   
 4 freshwater:marine_mud:terrestrial low          4  1   
 5 freshwater:terrestrial            high        29  0.06
 6 freshwater:terrestrial            low        453  0.94
 7 marine_mud                        high         3  0.01
 8 marine_mud                        low        564  0.99
 9 marine_mud:terrestrial            low          6  1   
10 terrestrial                       high        44  0.06
11 terrestrial                       low        660  0.94

Ok this looks semi-sensible. We will now save this dataset out to try and fit some transition models to the habitat preference by diversification rate bin.

Code
saveRDS(dplyr::select(tip_rates, tip_label, div_rate), here('data/sequencing_rpoB/processed/div_rate_bins.rds'))